fix: software value failing for large repos [CM-1029]#3947
Conversation
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
|
|
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
services/apps/git_integration/src/crowdgit/services/software_value/software_value_service.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Pull request overview
This PR improves reliability of the software value calculation for very large repositories by adding an option to skip very large files during SCC analysis, and enabling that option automatically for large repos in the Python worker.
Changes:
- Added a
--no-largeCLI flag to the Gosoftware-valuebinary and propagated it through SCC invocations. - Updated Python
SoftwareValueServiceto compute repository disk usage and automatically add--no-largefor repos ≥ 10GB. - Minor robustness/clarity improvements (usage text, error message cleanup).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| services/apps/git_integration/src/crowdgit/services/software_value/software_value_service.py | Adds repo-size detection and conditionally appends --no-large to the binary invocation. |
| services/apps/git_integration/src/crowdgit/services/software_value/main.go | Introduces --no-large flag and passes it through to SCC execution (including large-file threshold args). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
services/apps/git_integration/src/crowdgit/services/software_value/software_value_service.py
Show resolved
Hide resolved
services/apps/git_integration/src/crowdgit/services/software_value/software_value_service.py
Outdated
Show resolved
Hide resolved
services/apps/git_integration/src/crowdgit/services/software_value/main.go
Show resolved
Hide resolved
services/apps/git_integration/src/crowdgit/services/software_value/software_value_service.py
Outdated
Show resolved
Hide resolved
services/apps/git_integration/src/crowdgit/services/software_value/software_value_service.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
6685eac to
b5f1207
Compare
| func runSCC(sccPathPath string, noLarge bool, args ...string) (string, error) { | ||
| var cmdArgs []string | ||
| if noLarge { | ||
| cmdArgs = append(cmdArgs, "--no-large", "--large-byte-count", "100000000") |
There was a problem hiding this comment.
Missing --large-line-count override causes unintended file exclusion
Medium Severity
When --no-large is enabled, scc filters files exceeding either the byte count or the line count threshold. The code sets --large-byte-count to 100000000 but does not set --large-line-count, so scc's default of 40000 lines applies. This means source files with more than 40000 lines (but well under 100MB) are silently excluded from the COCOMO cost calculation, understating the software value — even though the stated intent is only to skip files larger than 100MB.
Signed-off-by: Mouad BANI <mouad-mb@outlook.com> Signed-off-by: Yeganathan S <63534555+skwowet@users.noreply.github.com>


This pull request adds support for handling very large repositories in the software value calculation service. The main change is the introduction of a
--no-largeflag that, when enabled, skips files larger than 100MB to prevent out-of-memory errors during analysis. The Python service now automatically enables this flag for repositories larger than 10GB, improving reliability for large codebases. Several functions in the Go codebase are updated to propagate and handle this flag.Large repository handling:
--no-largecommand-line flag to the Go binary (main.go) to skip files larger than 100MB when analyzing repositories, preventing OOM errors on large repos. This flag is propagated through all relevant functions and passed to thescctool. [1] [2] [3] [4] [5] [6] [7] [8]software_value_service.py), added logic to check the repository size before running the Go binary. If the repo is larger than 10GB, the--no-largeflag is automatically added to the command invocation. [1] [2]Code robustness and clarity:
du -sb.Note
Medium Risk
Changes the software value pipeline to conditionally skip large files and bypass analysis for specific repos, which can alter reported metrics and relies on new size-detection shelling out (
du).Overview
Improves software value analysis robustness for very large repositories by adding a
--no-largeflag to the Gosoftware-valuebinary and propagating it through SCC execution (skipping files >100MB viascc --no-large --large-byte-count 100000000).Updates the Python
SoftwareValueServiceto (1) skip a hardcoded excluded repo ID entirely and (2) automatically enable--no-largewhen repo disk usage (viadu -sb) is >= 10GB, plus minor usage/error-message cleanup in the Go binary.Written by Cursor Bugbot for commit 85a3030. This will update automatically on new commits. Configure here.